Supervised representation learning based on various levels of pediatric radiographic views for transfer learning

Transfer learning plays a pivotal role in addressing the paucity of data, expediting training processes, and enhancing model performance. Nonetheless, the prevailing practice of transfer learning predominantly relies on pre-trained models designed for the natural image domain, which may not be well-suited for the medical image domain in grayscale. Recognizing the significance of leveraging transfer learning in medical research, we undertook the construction of class-balanced pediatric radiograph datasets collectively referred to as PedXnets, grounded in radiographic views using the pediatric radiographs collected over 24 years at Asan Medical Center. For PedXnets pre-training, approximately 70,000 X-ray images were utilized. Three different pre-training weights of PedXnet were constructed using Inception V3 for various radiation perspective classifications: Model-PedXnet-7C, Model-PedXnet-30C, and Model-PedXnet-68C. We validated the transferability and positive effects of transfer learning of PedXnets through pediatric downstream tasks including fracture classification and bone age assessment (BAA). The evaluation of transfer learning effects through classification and regression metrics showed superior performance of Model-PedXnets in quantitative assessments. Additionally, visual analyses confirmed that the Model-PedXnets were more focused on meaningful regions of interest.


Upstream: pediatric dataset
A total of 2,598,404 pediatric radiographs were collected from 1995 to 2018 at Asan Medical Center (AMC) retrospectively, and we define this dataset raw original data (see Fig. 1).The age range of the original dataset was from 0 to 18.For reflection on the actual frequency of occurrence in the medical center, we divided the original dataset into the training and validation set based on the reference date; 2018 Jul.The validation set in the original dataset consists of a total of 81,131 radiographs over the period of 2018 Jul to 2018 Dec.The training dataset in the original dataset was composed of the remaining 2,499,598 radiographs.The original dataset had a severe imbalance distribution by prescription code.Therefore, when applying our proposed radiographic views labeling, the imbalance by class became highly severe.To address this, we under-sampled data according to the least frequent class and matched the total number equally for a fair comparison between PedXnet-7C, PedXnet-30C, and PedXnet-68C datasets.After the sampling, PedXnet-7C dataset consisted of 70,000 total (i.e., 10,000 radiographs per class), PedXnet-30C dataset consisted of 69,000 total (i.e., 2,300 radiographs per class), and PedXnet-68C dataset consisted of 68,000 total (i.e., 1,000 radiographs each class).Additionally, we constructed a fine-tuning set by separating 10% per class from each training set for hyper-parameter tuning.For more information on the class details of the upstream datasets by radiographic views labeling type (see Supplementary Tables 1-3).The baseline characteristics of upstream dataset are in the Supplementary Table 5.

Downstream: fracture dataset
We utilized the publicly available GRAZPEDWRI-DX dataset 21 , comprising annotated pediatric trauma wrist radiographs from 6091 patients who received treatment in the Department of Pediatric Surgery at the University Hospital Graz between 2008 and 2018.This dataset contains 20,327 images, predominantly featuring posteroanterior and lateral views.It represents a wide range of patient demographics, with a mean age of 10.9 years (ranging from 0.2 to 19 years; comprising 2,688 females, 3402 males, and one individual of unknown gender).To create a binary classification dataset, we filtered the 'fracture visible' column from the annotations to differentiate between 'fracture' and 'no fracture' categories, thereby securing a binary label.The ratio of fractures to non-fractures was established at 2:1.The dataset was randomly divided into training, fine-tuning, and validation sets, adhering to a 3:1:1 ratio.

Downstream: bone age prediction dataset
The dataset was released in RSNA Pediatric Bone Age Challenge (2017).The organizers provided the lists of training, fine-tuning, and validation sets.According to Halabi et al. 10 , the training and fine-tuning sets had similar age distributions with an average of 127.321 and 127.156 months, and the validation set had an age distribution with an average of 132.096 months.Radiographs for the training and fine-tuning sets were obtained from Children's Hospital Colorado (Aurora, Colo) and Lucile Packard Children's Hospital at Stanford.pediatric radiographs for the validation set were collected from Lucile Packard Children's Hospital.The radiographs were provided with skeletal age estimates and sex from the accompanying clinical radiology report provided at the time of imaging.The Greulich and Pyle standard method (G-P method) 22 was used by reviewers to determine the ground truth bone age.

Preprocessing
For each image, two simple pre-processing methods were applied.First, min-max normalization with 0.5% clipping of upper and lower bounds was performed to suppress the effect of the L/R mark in radiographs and remove the outlier pixel values.We utilized the raw DICOM (Digital Imaging and Communications in Medicine) files as is, hence the min-max normalization was applied to the stored bit value range.A set of pixel values of original and scaled images is represented by X, Z respectively; the formula of min-max normalization is as follows: Second, due to the limitation of GPU resources, all images size were resized down into 512 × 512 by bi-cubic interpolation with keeping the aspect ratio.The size of the image is based on the Kim et al. 23 .Due to the characteristics of pediatric radiographs, there are various radiographic views protocols and the various size of the body depending on the age.Thus, we used strong image augmentations to alleviate the heterogeneity of the pediatric radiographs and make the model become robust t pediatric radiograph protocols in various anatomic locations.We used the image augmentation library, Albumentation 24 , and adopted the eight augmentation methods as follows: ShiftScaleRotate, HorizontalFlip, RandomBrightness, RandomContrast, RandomGamma, GaussNoise, Sharpen, and RandomBlur.Considering these previous studies 25,26 , we set the appropriate batch size emphatically depending on the upstream and downstream tasks.The batch sizes of upstream and downstream tasks were 60 and 20.Each model is initialized by a uniform Xavier and trained with an Adam optimizer, a learning rate of 1e−4 using a warm-up of 5 epochs, weight decay of 5e−4, and betas of (0.9, 0.999).The learning rate was reduced during the training following the polynomial learning rate schedule: (1 − epoch/epoch max ) 0.9 .The total number of epochs is up to 500.However, each model was selected in the experiments as a converged model that has recorded the highest validation scores.All our models were implemented in Python version 3.6.9with Pytorch version 1.6.0,accelerated by an NVIDIA TITAN RTX 24 GB graphics processing unit (GPU).

Radiographic views labeling for PedXnets
We benchmarked the balanced class dataset, ImageNet 1 , and its hierarchical structure labeling based on WordNet 27 .As shown in Fig. 2, a hierarchical structure could be constructed with anatomical information for a major 7classes and radiographic views information for 68 classes in a large-scale original pediatric dataset.
In detail, we divided it into seven major anatomic areas of the human body including the head, chest, upper extremity, abdomen, pelvis, spine, and lower extremity with all pediatric radiographs for construction of the PedXnet-7C.Furthermore, we subdivided from the 7 classes to 30 classes based on the detailed anatomic areas of radiographs for the composition of the PedXnet-30C; and much subdivided into 68 classes using radiographic protocols of radiographs for configuration of the PedXnet-68C.As shown in Fig. 1, there were strong imbalances between classes in the process of performing these radiographic views labeling in the collected raw original pediatric dataset.Thus, we built PedXnet-7C, PedXnet-30C, and PedXnet-68C using random sampling to construct balanced datasets among classes with the total number of training data.

Supervised representation learning using radiographic views label
We performed radiographic views recognition tasks as upstream tasks with our PedXnet-7C, PedXnet-30C, and PedXnet-68C for making the model capture the representation of radiographic views information of radiographs (see Fig. 3a).The models were trained to classify pediatric radiographs into each corresponding radiographic views class.The classification task loss was defined as the cross-entropy loss (CE loss), as follows: where M is the number of classes, y is the probability of M dimension outputs and y is the M dimension one-hot encoded ground truth.For radiographic views classification tasks, InceptionV3 4 , a widely used CNN architecture since the ILSVRC 2015, was chosen, because InceptionV3 is recognized for its performance and is often used in medical problems e.g., detecting fractures 8,28 and BAA 10 .In addition, according to Ke et al. 18 , when transfer learning was performed on 14 radiological observations classification tasks on chest radiographs using ImageNet pre-trained weight, the performance of InceptionV3 was rather lower than when ImageNet pre-trained weight was not used.Thus, the inceptionV3 was selected as a basic CNN architecture to find a suitable representation for the medical domain.The InceptionV3 has 11 convolution layers of 1 × 1, 1 × 3, and 1 × 5 kernels, and convo- lution blocks are applied along with the max-pooling layer for down sampling.All convolutional layers include batch normalization techniques and rectified linear unit (ReLU) layers.In the upstream tasks, predictions are

Transfer learning for medical problems
To assess whether our proposed radiographic views representations by PedXnets benefit applications for medical problems, we conducted two pediatric downstream tasks; Fracture classification, BAA (see Fig. 3b).First, the classification of fractures in the upper and lower extremity of pediatric radiographs is considerably important 29,30 .
A fracture can occur anatomically anywhere and frequently take place in childhood.In particular, fractures occur mainly in the upper and lower limbs of the body.Therefore, the model should be able to recognize fractures features robustly in multi-view of radiographs.The task could evaluate the transferability of Model-PedXnets at the multi-view task.Second, BAA in hand pediatric radiographs is also meaningful for evaluating the transferability of Model-PedXnets at the single-view task.For each downstream task, the Model-Baseline, Model-ImageNet, Model-PedXnet-7C, Model-PedXnet-30C, and Model-PedXnet-68C were applied with the same training settings.
To solve the fracture classification task using transfer learning, the models should extract general features of fracture in the upper and lower extremities of radiographs.We trained the Model-Baseline from scratch and conducted transfer learning using the Model-PedXnet-7C, Model-PedXnet-30C, Model-PedXnet-68C, and Model-ImageNet for the binary classification task of fractures in the upper and lower extremities of radiographs.The classification task loss is defined as the binary cross-entropy loss (BCE loss), as follows: where y is the probability of model output and y is the ground truth.The same preprocessing process and aug- mentations at the upstream task were performed but inspired by Parveen et al. 31 but contrast limited adaptive histogram equalization (CLAHE) 32 was additionally applied to emphasize the bone contrast.For a fair comparison, the same batch size, optimizer, learning rate, and scheduler at the upstream tasks were used except total epoch.The number of epochs at this task is up to 300.However, each model was selected at a converged model that has recorded the highest validation scores.
To solve the bone assessment task using transfer learning, BAA is mainly measured from hand radiographs, and the model should extract detailed features from the bones of the wrist, hand, and finger in only hand anteroposterior radiographs.We trained the Model-Baseline from scratch and executed transfer learning using the Model-PedXnet-7C, Model-PedXnet-30C, Model-PedXnet-68C, and Model-ImageNet for the regression task of bone age in the hand radiographs.The regression task loss is defined as the mean square error loss (MSE loss), as follows:

Consent to participate
The requirement of patient informed consent was waived by the Institutional Review Board Committee of Asan Medical Center.

Upstream results of supervised radiographic views representation task
As shown in Table 1, when the highest value epoch model was selected from the fine-tuning set results and referred to the validation set, Model-PedXnet-7C, Model-PedXnet-30C, and Model-PedXnet-68C all have high performances (F1 > 0.78, Accuracy > 0.90, Precision > 0.84, Recall > 0.79).The upstream results indicated the Model-PedXnets learned the representation without overfitting, so we could use the model weights of the upstream task for application to downstream tasks.Figure 4 indicates Model-PedXnet-7C's activation maps were visualized using Grad-CAM.The Model-PedXnet-7C was activated in the region of interest (ROI) and the activation maps demonstrate that Model-PedXnet-7C could capture clinically meaningful features.After pretraining, the Inception V3 model serves as a learnable feature extractor when applied to downstream tasks.It is initialized with its PedXnet pre-trained weights, excluding the last three layers that make up the fully connected layer.The training methodology for downstream tasks employs a comprehensive strategy that involves the entire model.

Downstream task result for fracture classification task
As presented in Table 2, a comparison between Model-PedXnets and Model-Baseline reveals that the AUC scores of Model-PedXnets significantly surpassed those of Model-Baseline, indicating statistically significant differences.Notably, Model-PedXnet-30C demonstrated superior performance across all metrics, with the exceptions of specificity, and PPV.The features of the last InceptionV3 convolution layer for Model-Baseline, Model-PedXnet-30C, and Model-ImageNet were visualized using Grad-CAM to verify their representations.The radiographs selected for this visualization were randomly chosen from the test dataset.As illustrated in Fig. 5 concentrated depiction of a fracture lesion among the three models.We additionally validated the downstream fracture task using internal data from Asan Medical Center.For further details, please refer to the "Expansion of downstream task" section in the Supplementary Materials.

Downstream task result for bone age assessment task
Model-PedXnet-7C achieved the best performances of 5.245 in MSE, 42.857 in MAE, and 0.974 in R-square in the BAA task in Table 3.The Model-PedXnet-7C and Model-PedXnet-30C showed performance improvements in MAE compared to the baseline model.Figure 6 indicates that Model-PedXnet-7C captured the most important regions to predict bone age such as carpus and metacarpophalangeal joints, most intensively.The plotted radiographs were randomly chosen in the validation set.We added more plots of activation maps in Supplementary Fig. 3.

Discussion
Most of the previous medical tasks mainly use transfer learning because the scarcity of various cases and privacy protection issues cause the difficulty of medical data collection.Especially most pediatric studies rely on ImageNet representation.However, it is still debated whether the ImageNet representation is suitable for the medical domain 38 .In this study, we constructed the class-balanced pediatric dataset, PedXnets, and proposed our Model-PedXnets framework to reap the benefits of transfer learning in medical domains.In Tables 2 and 3, the Model-PedXnets showed superior performances improvements by a large margin compared to Model-Baseline in downstream pediatric tasks including fracture classification and bone age assessment.Even though using only approximately 70,000 images, PedXnets, smaller-scale datasets than ImageNet, the Model-PedXnets showed equal or superior performances compared with Model-ImageNet.The findings of this study revealed that data including medical content, even if it is not as large as ImageNet, is better for solving medical problems.To the best of our knowledge, this is the first study to demonstrate representative learning with pediatric radiographs and compare the effects of transfer learning with two major pediatric tasks.In addition, these differences expressed in the activation maps were more pronounced in qualitative results.In the fracture downstream task, the Model-ImageNet focused on some minor local context, while the Model-PedXnet focused on more medically meaningful ROI.The Model-PedXnet accurately concentrates the fracture  part without being affected by casts compared to other methods.In Supplementary Figs. 1 and 5, we presented radiographs of the upper and lower extremity and Model-PedXnet appropriately highlights the fracture site in various radiographic views.In the BAA downstream task, we presented some hand radiographs and activation maps in Supplementary Fig. 3 and Fig. 6.Our activation maps were dynamic changes according to age and important areas for predicting were carpus, thumb, and metacarpophalangeal joints 39 .Model-PedXnet less highlighted meaningless information with high intensity, such as L or R marks and lines of films in the radiographs.As shown in Supplementary Figs. 2 and 4, there was no significant difference in ROI activity between the Model-PedXnets in the downstream tasks.Because Model-PedXnets training strategy was designed to extract pediatric radiographs' context features with the radiographic views labeling, which could help the models to understand the important pediatric regions of the radiographs.As shown in Supplementary Table 4, similar results were shown in the ablation study where the number of training data was limited in the downstream task.In addition, comparing the results of the among Model-PedXnets, it was found that radiographic views representation made with fewer classes of datasets, Model-PedXnet-7C, was more effective, unlike ImageNet representation with 1000 various class distributions.Radiographs serve as essential tools for medical diagnosis, and due to the risks associated with radiation exposure, the protocols for their use are meticulously regulated.Especially, pediatric radiographs include views of various sizes according to age.Therefore, the excessive dividing of the data class up to the protocol of the radiographs could collect simple and almost identical images, which would decrease the transfer learning effects because it was a highly trivial task.It also occurred in overlapping regions between classes, for example, chest AP view images were similar to abdomen AP views in newborns and infants, which would act as a kind of label noise.The network would miss meaningful features and result in a negative transfer phenomenon.Additionally, the performance decline in more detailed classes may be due to the diminished training data per class as their number increases, likely leading to insufficient learning for the radiologic view in PedXnet-68C.
Despite the improved performance, our method has some limitations.First, as our proposed methodology can rely on the backbone network and pre-processing, it can lead to sufficiently different results by the different backbone networks and pre-processing.However, we fixed the InceptionV3 and the preprocessing because of the limited GPU in our study and left it open for discussion.Second, as shown in Fig. 1, we performed excessive random under-sampling in the raw original dataset to build class-balanced datasets according to the anatomical hierarchy of radiographic views.This has reduced the total number of training data and there might be a possibility that the total number of data was insufficient compared to ImageNet, so it did not show an appropriate effect 40 .Third, since the radiographic views labeling may vary depending on the radiologist and our proposed method is a supervised manner, the results could be greatly changed by the label method.Labeling the data class from an anatomical or radiographic perspective can be somewhat subjective.Fourth, the developed pretraining model was trained and validated exclusively on pediatric data.The domain gap between upstream and downstream tasks in pre-training research is a critical factor, as it can significantly impact the effectiveness of pretraining models.The influence of pretraining with pediatric data on other medical datasets remains an area for future investigation.Our research is focused on demonstrating the effects of pre-training through supervised learning, using the radiographic views labeling in pediatric data.Comparing the pretraining model with the adult chest X-ray dataset, CheXNet 41 , is considered as future work.Recent advancements in unsupervised learning and the growing need for pre-trained models tailored to medical domains have led to significant developments.Specifically, previous studies [42][43][44] have successfully developed pre-trained weights designed for reconstruction tasks, resulting in substantial performance improvements in dense prediction tasks.Building on these foundations, future work will aim to establish a more effective framework for unsupervised representation of radiographic views.Fifth, we verified the effect of pretraining exclusively through the full fine-tuning method when applied to downstream tasks.Investigating the pretraining effect via other transfer learning techniques, such as Linear Probing and Gradual Unfreezing, remains a subject for future research.Sixth, to improve the reliability and objective quantification of the Grad-CAM results, we intend to incorporate the implementation of a blind test

Conclusion
In this study, we introduced a supervised manner of medical representation learning for pediatric tasks with radiographic views labels.First, we designed the class-balanced pediatric radiographs datasets (PedXnets) by radiographic views labelings.And by using the PedXnets, we conducted representation learning helpful for pediatric problems through a radiographic views' classification task in a supervised manner.According to the evaluation results, the representation of major anatomical information was effective and the transfer effect of Model-PedXnet was positive in both pediatric downstream tasks including fracture classification and bone age assessment tasks.The Model-PedXnets showed superior results by a large margin compared to Model-Baseline and even showed results equivalent or improved to the Model-ImageNet even though the PedXnets were smaller than ImageNet.In addition, the proposed representation learning allowed networks to capture more semantic features in the ROI of radiographs.Our study could be helpful for medical domains, particularly pediatric radiographs research, which is difficult to collect data, so we aim to disclose the PedXnet's weights.

Figure 1 .
Figure 1.Flow chart of processing of upstream dataset with proposed radiographic views labeling in real-world medical radiographs dataset.Sampling was performed independently to build balanced datasets for each type of radiographic views labeling.In the case of the upstream validation set, the same radiographic views labeling was applied to the fixed dataset after a reference date (2018 Jul), so the class-wise mean and variance per labeling type were different.N the total number of the data.

Figure 2 .
Figure 2. Overview of radiographic views labeling process for PedXnets.i.e., the lower extremity class in the 7 class can be divided into several classes: hip, femur, knee, lower leg, ankle, foot, and toe in the 30 class.These can be divided into several subclasses based on the protocol code of radiographs: Foot AP, Foot Calcaneus, Foot Hindfoot, Foot Lateral, and Foot Oblique in the 68 class.PA posteroanterior, AP anteroposterior, KUB kidney, ureter, and bladder, PNS paranasal sinus, SI sacroiliac.

Figure 3 .
Figure 3. Overview of the Model-PedXnets framework.The framework consists of upstream and downstream tasks.In upstream tasks, radiographic views recognition of pediatric radiographs as a pretext for developing pre-trained models.In downstream tasks, transfer learning with the pre-trained weights for solving two medical problems including fracture classification and bone age assessment.GAP global average pooling, ReLU rectified linear unit.

Figure 4 .
Figure 4. Plots of Model-PedXnet-7C model's Grad-CAM activation maps of radiographic views recognition task as an upstream task with independent validation sets according to each labeling method.Grad-CAM gradient-weighted class activation mapping.

Figure 5 .
Figure 5. Comparisons of activation maps in the intermediate layer of Model-PedXnet, Model-Baseline, and Model-ImageNet models using Grad-CAM in the fracture downstream task.Grad-CAM gradient-weighted class activation mapping, Model-Baseline scratch model, Model-PedXnet the model pretrained from PedXnet-7class task, Model-ImageNet the model pretrained from ImageNet.

Figure 6 .
Figure 6.Plots of activation maps in the intermediate layer of Model-ImageNet, Model-Baseline, and Model-PedXnet using channel-wise mean activation map in the BAA.Please refer to Supplementary Fig. 4 for the activation map according to PedXnet types.The sample in the first row is 152 months old, and the sample in the second row is 167 months old.The carpus and metacarpophalangeal joints are critical regions for bone age assessment 39 .BAA bone age assessment, Model-Baseline scratch model, Model-PedXnet the model pretrained from PedXnet-7class task, Model-ImageNet the model pretrained from ImageNet.

Table 1 .
The performance comparisons of radiographic views recognition task as an upstream task.

Table 2 .
The performance comparisons of the fracture classification task.DeLong's test method was adopted for pairwise ROC comparison between the baseline and each model.Mean of all reader group is shown with 95% confidence interval.ROC receiver operating characteristic, AUC area under the ROC curve, PPV positive predictive value, NPV Negative predictive value, Model-Baseline scratch model, Model-PedXnet-7C the model pretrained from PedXnet-7class task, Model-PedXnet-30C the model pretrained from PedXnet-30class task, Model-PedXnet-68C the model pretrained from PedXnet-68class task, Model-ImageNet the model pretrained from ImageNet.*p < 0.05.Significant values are in bold.

Table 3 .
The performances comparisons of bone age assessment.Paired t-test method was adopted for MAE comparison between the baseline and each model.stdev.standard deviation, MAE mean average error, MSE mean squared error, Model-Baseline scratch model, Model-PedXnet-7C the model pretrained from PedXnet-7class task, Model-PedXnet-30C the model pretrained from PedXnet-30class task, Model-PedXnet-68C the model pretrained from PedXnet-68class task, Model-ImageNet the model pretrained from ImageNet.Significant values are in bold.